Introduction

Materials and methods

Materials


Kaggle breast cancer data

26 explanatory variables.

  • Condition
  • tumor type
patient_id gender education treatment_data id_healthcenter id_treatment_region hereditary_history birth_date age weight thickness_tumor marital_status marital_length pregnency_experience giving_birth age_FirstGivingBirth abortion blood taking_heartMedicine taking_blood_pressure_medicine taking_gallbladder_disease_medicine smoking alcohol breast_pain radiation_history Birth_control(Contraception) menstrual_age menopausal_age Benign_malignant_cancer condition
111036008041 0 4 2019 1.11e+09 1.11e+09 1 1989 30 69 0.90 1 0 0 0 0 0 4 0 1 1 0 0 1 1 1 1 0 1 death
111035996130 0 6 2019 1.11e+09 1.11e+09 0 1989 30 71 0.80 0 0 0 0 0 0 1 1 1 1 0 1 0 0 0 2 0 0 death
111035971333 0 5 2019 1.11e+09 1.11e+09 0 1989 30 74 0.90 1 0 0 0 0 1 4 1 1 0 0 0 1 1 0 1 0 1 death
111036018485 0 5 2019 1.11e+09 1.11e+09 1 1989 30 75 0.70 1 1 1 3 1 0 2 1 1 1 1 0 0 0 0 2 0 0 death
111035985474 0 1 2019 1.11e+09 1.11e+09 0 2009 10 70 0.25 0 0 0 0 0 0 7 0 0 0 0 0 0 0 0 0 0 0 death
111035903616 0 3 2019 1.11e+09 1.11e+09 1 1989 30 79 0.70 0 0 0 0 0 0 6 1 1 1 0 1 1 1 1 1 0 1 death
111036003507 0 4 2019 1.11e+09 1.11e+09 1 1990 29 96 0.10 0 0 0 0 0 0 4 1 1 0 0 0 1 1 0 2 0 1 death
111036026259 0 5 2019 1.11e+09 1.11e+09 0 1990 29 75 0.80 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 2 0 0 death

Process


Cleaning the data


Unrealistic age and weight proportions.

Cleaning the data


Changes
Column names
Removed special characters from variable names
Variable values
Greedy cleanup of binary variables: None binary values set to 1
Blood type had one value of 44 (mistake) which was changed to NA
Birth dates of other than 4 numbers were set to NA
Filter out samples
Only include women as they are the risk group of breast cancer (few men are hit)
Only include women > 20 years, as they are the primary risk group
Remove samples with abnormal weight age proportions
Filter out a single woman who is set to not yet have her period, but have experience in pregnancy
Removing columns
Remove singular columns (with only one value for all samples)

Augmenting the data


Changes
Values changed
Categorical variables encoded with label and encoded as factor
Adding columns
Age at treatment
Normalised numerical variables

Exploratory Analysis

Distributions

Distributions

Distributions

Distributions

MCA Tumortype

MCA Condition

MCA Rotation

Model

Predicting tumortype


Reduced Model

  • Age (norm)
  • Weight (norm)
  • Hereditary history
  • Smoking
  • Radiation therapy
  • Menstrual age
  • Pregnancy experience
  • Abortion
  • Breast pain
model sensitivity specificity balanced_accuracy
Max_pred 69% 28% 48%
Red_pred 81% 24% 53%
baseline 100% 0% 50%
Note:
Positive class = Malignant

Predicting Condition


Reduced Model

  • Age
  • Gallbladder medicine
  • Menopausal age
  • Abortion
model sensitivity specificity balanced_accuracy
Max_pred 89% 36% 63%
Red_pred 92% 33% 63%
baseline 100% 0% 50%
Note:
Positive class = Death

Discussion & Conclusion

Discussion


  • Greedy cleaning approach
  • Disagreement between MCA and reduced model
  • A general set of rules for valid entries

Conclusion


  • Possible to predict both tumor type and outcome.
  • The prediction accuracy.
  • Shinny app

Bibliography

Shiny app

Shiny App


THANKS
FOR YOUR ATTENTION

Learning objectives


  • Explain why reproducible data analysis is important, as well as identify relevant challenges and explain replicability versus reproducibility
  • Describe the components of a reproducible data analysis
  • Use Tidyverse R to perform exploratory data analysis (EDA) for data insights, including using ggplot to visualize multilayer data from e.g. high-througput -omics platforms
  • Use Tidyverse R to perform data cleansing, transformation, visualization and communication
  • Use RStudio and git/GitHub for collaborative analysis projects
  • Perform and interpret standard dimension reduction and clustering techniques, as well as basic statistical tests and models